Programming with CUDA
Jens K. Mueller [email protected]
Department of Mathematics and Computer Science Friedrich-Schiller-University Jena
Tuesday 31st May, 2011 Today’s lecture: Analyzing Parallel Algorithms Sequential Machine Models
Well established models sequential models of computability
I Turing machine
I Random Access Machine (RAM)
I µ-recursive functions
I λ-calculus All these notions of computability are Turing equivalent.
Church hypothesis: "Every function that is computable is computable by a Turing machine."
2011-05-31 – CUDA 3 / 20 Parallel Machine Models
I Directed acyclic graph
I Networking models CPUs directly connected
I Parallel Random Access Machine (PRAM) RAMs only connected via infinite shared memory
I Exclusive Read Exclusive Write (EREW)
I Concurrent Read Exclusive Write (CREW)
I Concurrent Read Concurrent Write (CRCW)
I Common Concurrent Read Concurrent Write I Arbitrary Concurrent Read Concurrent Write I Priority Concurrent Read Concurrent Write I Associative Concurrent Read Concurrent Write
I ... Algorithm analysis for one model may not adapt to another model. 2011-05-31 – CUDA 4 / 20 Algorithm Analysis
I Run-Time analysis
I Space analysis in dependence to input size
I Worst Case
I Average Case
I Best Case
I Amortized Average time per operation over a worst-case sequence of operations.
2011-05-31 – CUDA 5 / 20 Big O Notation
I f (n) ∈ O(g(n)) There exists a c for all n > n0 such that f (n) ≤ c · g(n). f is asymptotically bounded above by g (up to a constant factor).
I f (n) ∈ Ω(g(n)) There exists a c for all n > n0 such that f (n) ≥ c · g(n). f is asymptotically bounded below by g (up to a constant factor).
2011-05-31 – CUDA 6 / 20 Big O Notation (cont.)
I f (n) ∈ Θ(g(n)) There exists a c1, c2 for all n > n0 such that c1 · g(n) ≤ f (n) ≤ c2 · g(n). f is asymptotically bounded above and below by g (up to a constant factors). Shorthand for: f (n) ∈ O(g(n)) and f (n) ∈ Ω(g(n)).
2011-05-31 – CUDA 7 / 20 Parallel RAM
I Infinite processors
I Infinite memory
I Constant Random Access
I Processors clock is synchronized
I Instructions: Read, Compute, Write
I Run-time analysis dependent on input size and number of processors
2011-05-31 – CUDA 8 / 20 Running Time
Definition (Sequential Running Time) Given a problem P. Then T is the best known running time of an sequential algorithm that solves P.
Definition (Parallel Running Time)
Given a problem P. Then Tp is the best known running time of an parallel algorithm using p processors that solves P.
2011-05-31 – CUDA 9 / 20 Parallel Speedup Definition (Parallel Speedup)
T Sp := Tp where T is the running time of the best known sequential algorithm and Tp is the running time of a parallel algorithm using p processors. Lemma
Sp ≤ p
The parallel speedup can never be larger than the number of processors. Ideal speedup is p (linear speedup; n-times faster with n-times more processors).
2011-05-31 – CUDA 10 / 20 Parallel Efficiency
Definition (Parallel Efficiency) The efficiency of an parallel algorithm is S E = p p p
Corollary
Ep ≤ 1
2011-05-31 – CUDA 11 / 20 Parallel Cost
Definition (Parallel Cost) The cost of an parallel algorithm is
Cp := p · Tp
i.e., the product of parallel running time and the number of processors. Lemma The costs of an parallel algorithm are greater than the sequential running time:
T ≤ Cp
2011-05-31 – CUDA 12 / 20 Parallel Work
Definition (Parallel Work) The work of an parallel algorithm W is the total number of executed operations over all processors. Lemma
W ≤ Cp
Proof. PTp PTp W = i=1 Wi ≤ i=1 p = p · Tp = Cp Wi ≤p
2011-05-31 – CUDA 13 / 20 Work Optimality
Definition (Work optimal) A parallel algorithm is work optimal if W = T .
2011-05-31 – CUDA 14 / 20 Cost Optimality
Definition (Cost optimal)
A parallel algorithm is cost optimal if Cp = T . Corollary
For a parallel cost optimal algorithm it follows Sp = p and Ep = 1. Cost optimality implies linear speedup. Example (Summation)
I Sequential summation is cost optimal since C1 = 1 · T = 1 · n = O(n)
I Parallel summation (using a binary tree) is not cost optimal since Cn = n · log n = O(n · log n) 6= O(n)
2011-05-31 – CUDA 15 / 20 Sorting
I T = O(n log n)
I Cost optimal algorithms
I p = O(n) and Tp = O(log n) n 2 I p = O( log n ) and Tp = O(log n) 2 I Using O(n ) and Tp = O(1) is not cost optimal.
I Cost optimality takes care of the number of used processors. Important when there are fewer processors.
2011-05-31 – CUDA 16 / 20 Brent’s Theorem
Theorem (Brent)
A parallel algorithm using p processors with time Tp and work W can be executed on p0 < p processors with time W 0 Tp = Tp + b p0 c. Brent’s Theorem allows changing the number of processors for a given parallel algorithm. Example (Parallel Reduction (using a binary tree)) 0 n Tp = O(log n). Set p = O( log ). Then O(n) Tp0 = O(log n) + b n c = O(log n) + O(log n) = O(log n). O( log n ) Then the parallel reduction is cost optimal, i.e. n 0 Cp = O(log n) ·O( log n ) = O(n).
2011-05-31 – CUDA 17 / 20 Amdahl’s Law
Assume T = 1 and fraction x of the sequential algorithm can be parallelized using p processors. x Then Tp = (1 − x) + . | {z } p sequential part |{z} parallel part Overall speedup is 1 x (1 − x) + p
I Based on fixed input
2011-05-31 – CUDA 18 / 20 Amdahl’s Law (cont.) 10 x = 0.5 9 x = 0.6 x = 0.7 8 x = 0.8 x = 0.9 7 6 5 speedup 4 3 2 1 1 4 16 64 256 1024 4096 16384 65536 # processors
Figure: Amdahl’s Law
2011-05-31 – CUDA 19 / 20 Questions?