Programming with CUDA

Jens K. Mueller [email protected]

Department of Mathematics and Friedrich-Schiller-University Jena

Tuesday 31st May, 2011 Today’s lecture: Analyzing Parallel Algorithms Sequential Machine Models

Well established models sequential models of computability

I Turing machine

I Random Access Machine (RAM)

I µ-recursive functions

I λ-calculus All these notions of computability are Turing equivalent.

Church hypothesis: "Every function that is computable is computable by a Turing machine."

2011-05-31 – CUDA 3 / 20 Parallel Machine Models

I Directed acyclic graph

I Networking models CPUs directly connected

I Parallel Random Access Machine (PRAM) RAMs only connected via infinite

I Exclusive Read Exclusive Write (EREW)

I Concurrent Read Exclusive Write (CREW)

I Concurrent Read Concurrent Write (CRCW)

I Common Concurrent Read Concurrent Write I Arbitrary Concurrent Read Concurrent Write I Priority Concurrent Read Concurrent Write I Associative Concurrent Read Concurrent Write

I ... Algorithm analysis for one model may not adapt to another model. 2011-05-31 – CUDA 4 / 20 Algorithm Analysis

I Run-Time analysis

I Space analysis in dependence to input size

I Worst Case

I Average Case

I Best Case

I Amortized Average time per operation over a worst-case sequence of operations.

2011-05-31 – CUDA 5 / 20 Big O Notation

I f (n) ∈ O(g(n)) There exists a c for all n > n0 such that f (n) ≤ c · g(n). f is asymptotically bounded above by g (up to a constant factor).

I f (n) ∈ Ω(g(n)) There exists a c for all n > n0 such that f (n) ≥ c · g(n). f is asymptotically bounded below by g (up to a constant factor).

2011-05-31 – CUDA 6 / 20 Big O Notation (cont.)

I f (n) ∈ Θ(g(n)) There exists a c1, c2 for all n > n0 such that c1 · g(n) ≤ f (n) ≤ c2 · g(n). f is asymptotically bounded above and below by g (up to a constant factors). Shorthand for: f (n) ∈ O(g(n)) and f (n) ∈ Ω(g(n)).

2011-05-31 – CUDA 7 / 20 Parallel RAM

I Infinite processors

I Infinite memory

I Constant Random Access

I Processors clock is synchronized

I Instructions: Read, Compute, Write

I Run-time analysis dependent on input size and number of processors

2011-05-31 – CUDA 8 / 20 Running Time

Definition (Sequential Running Time) Given a problem P. Then T is the best known running time of an sequential algorithm that solves P.

Definition (Parallel Running Time)

Given a problem P. Then Tp is the best known running time of an using p processors that solves P.

2011-05-31 – CUDA 9 / 20 Parallel Definition (Parallel Speedup)

T Sp := Tp where T is the running time of the best known sequential algorithm and Tp is the running time of a parallel algorithm using p processors. Lemma

Sp ≤ p

The parallel speedup can never be larger than the number of processors. Ideal speedup is p (linear speedup; n-times faster with n-times more processors).

2011-05-31 – CUDA 10 / 20 Parallel Efficiency

Definition (Parallel Efficiency) The efficiency of an parallel algorithm is S E = p p p

Corollary

Ep ≤ 1

2011-05-31 – CUDA 11 / 20 Parallel Cost

Definition (Parallel Cost) The cost of an parallel algorithm is

Cp := p · Tp

i.e., the product of parallel running time and the number of processors. Lemma The costs of an parallel algorithm are greater than the sequential running time:

T ≤ Cp

2011-05-31 – CUDA 12 / 20 Parallel Work

Definition (Parallel Work) The work of an parallel algorithm W is the total number of executed operations over all processors. Lemma

W ≤ Cp

Proof. PTp PTp W = i=1 Wi ≤ i=1 p = p · Tp = Cp Wi ≤p

2011-05-31 – CUDA 13 / 20 Work Optimality

Definition (Work optimal) A parallel algorithm is work optimal if W = T .

2011-05-31 – CUDA 14 / 20 Cost Optimality

Definition (Cost optimal)

A parallel algorithm is cost optimal if Cp = T . Corollary

For a parallel cost optimal algorithm it follows Sp = p and Ep = 1. Cost optimality implies linear speedup. Example (Summation)

I Sequential summation is cost optimal since C1 = 1 · T = 1 · n = O(n)

I Parallel summation (using a binary tree) is not cost optimal since Cn = n · log n = O(n · log n) 6= O(n)

2011-05-31 – CUDA 15 / 20 Sorting

I T = O(n log n)

I Cost optimal algorithms

I p = O(n) and Tp = O(log n) n 2 I p = O( log n ) and Tp = O(log n) 2 I Using O(n ) and Tp = O(1) is not cost optimal.

I Cost optimality takes care of the number of used processors. Important when there are fewer processors.

2011-05-31 – CUDA 16 / 20 Brent’s Theorem

Theorem (Brent)

A parallel algorithm using p processors with time Tp and work W can be executed on p0 < p processors with time W 0 Tp = Tp + b p0 c. Brent’s Theorem allows changing the number of processors for a given parallel algorithm. Example (Parallel Reduction (using a binary tree)) 0 n Tp = O(log n). Set p = O( log ). Then O(n) Tp0 = O(log n) + b n c = O(log n) + O(log n) = O(log n). O( log n ) Then the parallel reduction is cost optimal, i.e. n 0 Cp = O(log n) ·O( log n ) = O(n).

2011-05-31 – CUDA 17 / 20 Amdahl’s Law

Assume T = 1 and fraction x of the sequential algorithm can be parallelized using p processors. x Then Tp = (1 − x) + . | {z } p sequential part |{z} parallel part Overall speedup is 1 x (1 − x) + p

I Based on fixed input

2011-05-31 – CUDA 18 / 20 Amdahl’s Law (cont.) 10 x = 0.5 9 x = 0.6 x = 0.7 8 x = 0.8 x = 0.9 7 6 5 speedup 4 3 2 1 1 4 16 64 256 1024 4096 16384 65536 # processors

Figure: Amdahl’s Law

2011-05-31 – CUDA 19 / 20 Questions?